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Abstract. Inference of hidden classes in stochastic block 
model is a classical problem with important applications. 
Most commonly used methods for this problem involve naive 
mean field approaches or heuristic spectral methods. Recently, 
belief propagation was proposed for this problem. In this con- 
tribution we perform a comparative study between the three 
methods on synthetically created networks. We show that be- 
lief propagation shows much better performance when com- 
pared to naive mean field and spectral approaches. This ap- 
plies to accuracy, computational efficiency and the tendency 
to overfit the data. 



1 Introduction 

A large portion of the intriguing emergent phenomena of com- 
plex many particle systems is a consequence of the structure 
of interactions among their constituents. Bluntly, a soup of 
neurons does not have the same capabilities as a specifically 
woven neural net. Similar considerations apply to social sys- 
tems, information systems, biological systems or economical 
systems where the patterns of interaction are far from random 
and result in complex system-wide phenomena. 

Fueled by a flood of readily available relational data, recent 
years have seen a surge of research focused on structural prop- 
erties of networks as first step to understanding some of the 
properties of complex systems and ultimately their function 
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Interestingly, it is often much easier to map the network 
of interactions than to explain its function. A prime example 
of this phenomenon are protein interaction networks. Mod- 
ern biotechnology allows to automatize charting the matrix 
of pairwise binding relations for all proteins produced by an 
organism, i.e. do two proteins form a stable link or not |23| . As 
proteins generally operate in complexes (agglomerates of sev- 
eral proteins) such a network of pairwise interactions encodes 
latent information about protein function. Hence, it makes 
sense to use network structure to make inferences about pro- 
tein function or plan and guide other wet-lab experiments 
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aimed at elucidating function [T5]. Similar considerations ap- 
ply to the analysis of social networks where interactions are 
recorded in online data streams but information on the prop- 
erties of the actual agents remains hidden behind pseudonyms 
or avatars 25 . 

Hence, the hypothesis behind network analysis is that nodes 
in a network which have similar patterns of interaction are 
likely to have common properties or perform similar function. 
Discovering topological similarities and differences thus hints 
at the existence of possible latent features of the nodes in the 
network that merit further analysis. 

Being a first step to more detailed analysis, such ex- 
ploratory analysis is often highly consequential. It is impor- 
tant to thoroughly understand the algorithms used in every 
detail and to be aware of possible limitations and pitfalls. 
This contribution aims at raising this awareness using the sim- 
ple example of inferring the parameters of a Poisson-mixture 
model, the so-called Stochastic-Block-Model (SMB) [9]|6], in 
undirected unweighted unipartite networks. The conclusions 
we draw, however, extend well beyond this example and we 
discuss these consequences at the end of the paper. 

Our contribution is then organized as follows: first we intro- 
duce the stochastic block model as a way to capture density 
fluctuations in relational datasets and infer latent variables. 
Next, we discuss the theoretical limitations that any inference 
technique for such a model must face: namely a sharp tran- 
sition between a parameter region where inference is feasible 
and a parameter region where inference is impossible. Third, 
we briefly review spectral approaches and the Expectation 
Maximization (EM) algorithm in conjunction with the naive 
mean field approach. We then introduce a formulation of the 
EM algorithm based on belief propagation. Fourth, we com- 
pare the performance of these three approaches on ensembles 
of benchmark networks from a region near the above men- 
tioned feasible-infeasible transition in the parameter space. 
In this region, particularly difficult problem instances can be 
found that allow to highlight performance differences. Finally, 
we discuss our findings and the possible extensions and con- 
sequences to other models and inference tasks. 

2 The Stochastic Block Model 

The simplest model of a network of N nodes and M undi- 
rected unweighted edges between them is a an Erdos-Renyi 
graph. It assumes that a link falls between any pair of nodes 
with constant probability pij = po ~ 2M/[N(N — 1)], 
independently of whether links exist between other pairs of 



nodes. Consequentially, large networks with low link density 
po generated by this model have a Poissonian degree distri- 
bution with mean degree (fc) = po(N — 1). This model can al- 
ready explain two main characteristics of real world networks 
- their small world property of short average path lengths 
and their connectedness even at low densities. Unfortunately, 
it cannot explain much more. In particular, it fails to capture 
the large variance of link densities between groups of nodes 
observed in many networks. 

In real world networks, not all nodes are created equal and 
may represent entities of very different properties or functions. 
Whether two nodes are linked often depends on these proper- 
ties. Consider the example of protein interaction again. Mem- 
brane proteins will certainly bind to other membrane proteins 
to form stable membranes, but, for example, the enzymes in- 
volved in various catalytic reactions should not stick to the 
cell membrane since otherwise the interior of the cell would 
soon be depleted of these essential molecules [19] . In an en- 
tirely different social context, one will certainly observe social 
interactions correlated with the agents' age, gender, possibly 
income or education. Social ties will depend on these qualities 
an thus network structure is indicative of node properties and 
may be used to make corresponding inferences. 

One of the simplest models capable of capturing the depen- 
dence of link probability on node properties is the Stochastic 
Block Model [5]. It assumes that each node i g {1,..,7V} is 
of one and only one of q classes and ti = r is indicating the 
membership of node i in class r £ {1, .., q}. As before, nodes 
are linked independently, but now, the probability of node i 
linking to node j depends on ti and tj alone, i.e. pij = ptitj- 
One can easily write down a probabilistic generative model 
for this sort of network. First, we assume that nodes are as- 
signed into q classes randomly by a multinomial distribution 
with parameters V(U — r) = p r . Next, we specify the matrix 
of link probabilities between classes p rs £ (0, l) qxq . Our set 
of parameter thus comprises of 9 — {q,p r ,p rs }. The proba- 
bility of generating a specific {0, l} iVxJV adjacency matrix A 
together with a specific assignment of nodes into classes t is 
then given as: 

p(A,ti0) = n [ptyv-pti*,? 1 -^] n*« o) 

The expected average density of links in such a network is 
po = X^rs PrPrsPs- If we were able to observe the adjacency 
matrix A and class memberships t at unknown parameters, 
equation {TJ would give us the complete data likelihood of the 
parameters 8. It is then easy to estimate the parameters 8* 
which maximize {TJ): 

Pr= wEA.r (2) 

P rs = Npr(Nps-Srs) ^i<3 Aij5 ti ,r5 tj ,s 

With {TJ being a member of the exponential family, these es- 
timators are consistent, efficient and the model is identifiable, 
i.e. the maxima are unique. In this contribution we always 
assume that the correct number of classes q is known. 

However, in practical applications as discussed, the situa- 
tion is often that we only have access to the adjacency matrix 
A but not to the class labels t which are our primary in- 
terest for explaining network structure and possibly function. 
Fortunately, under certain circumstances we can still draw 



conclusions about these hidden variables using the toolbox of 
statistical inference. What these circumstances are and how 
this is usually done will be discussed in the following two sec- 
tions. 

3 General Considerations 

It is clear that the task of inferring the unobserved latent 
variables is only possible if the preference matrix p rs shows 
sufficient "contrast" . If all entries were the same, i.e. p ra = po, 
then of course no method can perform any inference on the 
hidden variables. Conversely, if p rs = PoS r , s , then the network 
practically consists of several disconnected components and 
inference reduces to the trivial task of identifying the compo- 
nent to which an individual node belongs. Between these two 
extremes of impossible and trivial, there is a sharp phase tran- 
sition [201 |3j [2] . It divides the parameter space into a region 
where it is provably impossible to infer the latent variables 
with an accuracy higher than guessing and a region where it 
is possible with high accuracy. 

Theoretical analysis has shown that the transition persists 
in infinitely large networks when they are sparse, i.e. the av- 
erage degree per node does not grow with the system size. In 
other words, networks in which the elements of p ra scale as 
1/N. In contrast, for dense networks in which p rs does not 
scale with N, considering larger networks means considering 
proportionally larger average degrees and this will render even 
very small amounts of variance in p ra detectable and thus lets 
the region of impossible inference vanish [18] . 

In real applications, we cannot generally increase network 
size at constant parameters. We will observe both the region 
of impossible and possible inference. However, the parameter 
region of impossible inference will be smaller for denser net- 
works, i.e. those with higher average degree. Further, it has 
been shown that networks with parameters in the vicinity of 
the transition point are the instances in which inference is 
hardest [3J|5]. 

As it is our aim to highlight performance differences be- 
tween different inference techniques for the SBM, we will focus 
our attention on instances in sparse graphs near the transition 
from impossible to possible inference. Before we come to this 
analysis, however, we will introduce the contestants. 

4 Inferring Stochastic Block models 

When inferring latent structure in data, one can take the route 
of statistical inference if one can justify a statistical model to 
fit to the data as we've done with the SBM. It may also be 
sensible to use a simple dimensionality reducing heuristic. We 
consider both of these approaches. 

4.1 Spectral Approaches 

When dealing with high dimensional data such as networks 
and searching for common patterns of interactions, a natural 
strategy is to try reducing the dimensionality in such a way 
that nodes with similar interaction partners are mapped to 
positions in some low dimensional space, while nodes with 
very different interaction partners should be positioned far 
apart. One then uses standard clustering algorithms, such as 
fc-means in our case, originally developed for multivariate data 



and to analyze the nodes in their low dimensional embedding. 
This is the strategy behind all spectral techniques of network 
analysis. 

Let us consider the adjacency matrix A as a list of N mea- 
surements in an iV-dimensional feature space, each row de- 
scribing one node in N dimensions, namely, its relations to 
the other nodes. We could then apply a variant of multidimen- 
sional scaling such as principal component analysis (PC A). 
We would subtract the means of the measurements in each 
dimension, calculate the covariance matrix and find the di- 
rections of maximum variance by an eigen-decomposition of 
the co-variance matrix. Finally, we would project our data 
matrix onto the first q principal components, i.e. those eigen- 
vectors of the covariance matrix corresponding to the largest 
eigenvalues. 

A method similar in spirit has been introduced specifically 
for networks [T5]. It differs from PC A only slightly in that 
it not only removes the means of the rows, but, since A is 
symmetric, also the means of the columns. This is to say, the 
original matrix A is transformed into a so called modularity 
matrix B via ^ 

By = Atj — -yjj- . (3) 

This modularity matrix B now has row-sums and column- 
sums zero. Note that the terms kikj/2M <C 1 for sparse net- 
works. Since B is symmetric, the eigenvectors of a correspond- 
ing "covariance matrix" C = BB T are the eigenvectors of B 
and hence the projection of the modularity matrix onto the 
"principal components" is given directly by the components 
of the eigenvectors corresponding to the largest magnitude 
eigenvectors of B. This approach has recently been claimed 
to be no worse than any other approach 13 and we will eval- 
uate this claim in this paper. 

Another aspect of this method is worth mentioning. It is 
known that the best rank-q approximation to a symmetric 
matrix is given by its eigen-decomposition retaining only the q 
eigenvalues largest in magnitude. "Best" here means in terms 
of reconstruction error under the Frobenius norm. If V is a 
matrix the columns of which are the eigenvectors of B or- 
dered by decreasing magnitude of the corresponding eigen- 
value, then the entries of the optimal rank-g approximation 
B' will be given by 

9 

By =^VirKV jr . (4) 
r=l 

So we see that BL is large when the rows i and j of V are par- 
allel and all the considered A r with r £ {1, .., q} are positive. 
In contrast, if all A r are negative, rows i and j of V should 
be anti-parallel to make B\j large. Large positive eigenvalues 
are indicative of block models with some p rr large while large 
negative eigenvalues are indicative of block models with some 
p rr small in comparison to the average density of the network 
po- We can conclude that when these cases mix, it will gener- 
ally be very difficult to find an embedding that maps nodes 
from a network with similar interaction patterns to positions 
that are close in space using spectral decomposition of the 
modularity matrix. 

Instead of using an embedding that minimizes a reconstruc- 
tion error, one can also introduce a pairwise similarity mea- 
sure based on the network topology and then find an embed- 
ding of the N x N similarity matrix such that "similar nodes" 



are "close" . This approach is implemented in the widely used 
diffusion- map [11] , 

Assume a random walker is traversing the network. When 
at node i, the walker will then move to any node j ^ i with 
probability p^ t — Aij/ki. Here, ki = ^ Aij i s the number of 
neighbors of node i. We can identify in pju as the entries of 
an N x N row stochastic transition matrix P = D _1 A where 
D is a diagonal matrix with Da = fcj. The probability that 
the random walker, after starting in node i, reaches node j 
in exactly t steps is then given as Pt(j\i) = Py- The station- 
ary distribution of the random walker on the N nodes of the 
network is given by 7Tq = limt-^oo Pt(i\j) = ki/2M. Equipped 
with these definitions, one can define a "diffusion distance" 
between nodes i and j via 

A 2 (i,j) = E (pt(fc|?) T mr - ( 5 ) 

This is a sensible measure of topological distance between 
nodes i and j as it measures a difference in the distributions 
of arrival sites when the random walker starts from either i 
or j. One can find an optimal embedding such that the Eu- 
clidean distance in the low dimensional space matches the 
diffusion distance to any desired precision. The coordinates 
of this embedding are given by the entries in the eigenvectors 
corresponding to the q largest non-trivial right eigenvectors 
of P scaled by the corresponding eigenvalue to power t. Since 
the largest right eigenvalue of P is always Ai = 1 and the cor- 
responding eigenvector is constant, it is considered trivial. If 
a match to relative precision 5 is required we must include all 
eigenvectors v,- of P with |A,-|* > <5 1 A2 1 * where the A are the 
right eigenvalues of P. As all eigenvalues of P are smaller in 
magnitude than 1, A2 dominates for large t and thus the large 
scale structural features. In this case, large negative eigen- 
values are not a problem, since the embedding is such that 
Euclidian distance between the positions of the nodes in the 
low dimensional space approximates the topological distance 
and not the scalar product dressed with the eigenvalues as in 
the case of the spectral decomposition. 

4.2 Expectation Maximization 

The goal of maximum likelihood inference aims to estimate 
parameters for a generative model such that the observed data 
becomes maximally likely under this model. Our generative 
model ([TJ gives us the probability of observing the network 
and the node classes. If only the network is observed we need 
to trace out the node classes. Specifically, we seek 

0* = argmax 9 £(0) = log ^ V(A, t\6). (6) 
t 

The sum over all possible assignments of nodes into latent 
classes is computationally intractable and so one resorts defin- 
ing a lower bound on the log-likelihood C{9) which can be 
both evaluated and maximized. This bound is know as the 
Free Energy 

7{V{t),0) = J2 V(t) log V(A, t\0) - J2 ^(t) log P(t). (7) 

t t 

The Free energy is a functional of a distribution over the 
latent variables P(t) and the model parameters 6. It is easily 



shown that T is indeed a lower bound on L(ff): 

T(V{t), 9) = -D KL (P(t)\\T(t\A, 9)) + C{9). 



(8) 



and that if T has a (global) maximum in (V*(t),9*) then 
C(9) also has a (global) maximum in 9* [14] ■ The procedure 
for maximizing T in turn with respect to its two arguments is 
known as the Expectation Maximization algorithm 4; . Specif- 
ically, maximizing T with respect to V(t) at fixed 9 is known 
as the "E-Step" , while maximizing T with respect to 9 at fixed 
V(t) is known as the "M-Step". Ideally, the E-step tightens 
the bound by setting f>{t) = V(t\A, 9), but for our model fTJ 
the calculation of V(t\A, 9) is also intractable. Note that this 
is in contrast to estimating the parameters of a mixture of 
Gaussians where, for observed data X, we can easily evaluate 
V(t\X,9). 

Two routes of approximation now lie ahead of us: the first 
one is to restrict ourselves to a simple factorizable form of 
P(t) = Y[ t V{ti) which leads to the mean field approach. The 
second route leads to belief propagation. 

4.3 E-Step and M-Steps using the naive 
mean field 

We shall start by the mean field equations as used for the 
SBM for instance in pQ or [8]. In addition to the assumption 
of a factorizing V(t), one introduces the following shorthand: 
=V(ti — r). Then, the free energy in the naive mean field 
approximation is given by 

-FMF = Y.iKj.r. ( A H log 1^7 + l°g(l ~ Pr.)) titi 

+ E iir .V4(logPr-log^) (9) 

This free energy is to be maximized with respect to the ipr by 
setting the corresponding derivatives to zero and we obtain 
a set of self-consistent equations the ipl have to satisfy at 
V il T= 0: 



Vv = 



(10) 



K = Ei*,. Aij log j^-i>i +J2 a (N- S rs ) Pa log(l - Prs ) 

The beauty of this approach is its apparent computational 
simplicity as an update of V(t) can be carried out in 
0(N(k)q 2 ) steps. Setting VgJ-MF equal to zero and observ- 
ing the constraint that ~}2 r p r ~ 1, we derive the following 
equations for the M-step: 



(11) 



Note the similarities between eqns. (|11[) and l[2]l. 

4.4 E-Step and M-Steps using Belief 
Propagation 

Belief propagation equations for mixture models were used by 
several authors, see e.g. [71 1221121] . Several important nuances 
in the algorithm make us adopt belief propagation algorithm 
for SBM as developed in [31 [5] , the implementation can be 
dowloaded at |http://mode_ne t. krzakala .org/ [ 



There are several ways one can derive the Belief Propaga- 
tion equations (see for instance [26] V One way is from a re- 
cursive computation of the free energy under the assumption 
that the graphical model is a tree. Application of the same 
equations on loopy graphical models is then often justified by 
the fact that correlations between variables induced by loops 
decay very fast and are hence negligible in the thermodynamic 
limit. In the case treated here, even when the adjacency graph 
Aij is sparse, the graphical model representing the probabil- 
ity distribution P is a fully connected graph on N nodes. 
However, for sparse networks the interaction for nodes that 
are not connected by an edge is weak 1 — p rs « 1 and the net- 
work of strong interactions is locally tree-like. This puts us 
in the favorable situation of decaying correlations. This was 
used in [3l [2] to argue heuristically that in the limit of large 
N the belief propagation approach estimates asymptotically 
exact values of the marginal probabilities tpl and of the log- 
likelihood, in a special case of block model parameters this 
has been proven rigorously in [12] . 

To write the belief propagation equations for the likelihood 
U} we define conditional marginal probabilities, or messages, 
denoted = V(U = r\A\Aij,9). This is the marginal 

probability that the node i belongs to group r in the ab- 
sence of node j. In the tree approximation we then assume 
that the only correlations between i's neighbors are mediated 
through i, so that if i were missing — or if its group assignment 
was fixed — the distribution of its neighbors' states would be a 
product distribution. In that case, we can compute the mes- 
sage that i sends j recursively in terms of the messages that 
i receives from its other neighbors fe O [2] : 



K^ 3 = Y.h&j log 



(12) 



(13) 



The marginal probability tpl is then recovered from the mes- 
sages using (|10|l and 



K = log 



£ i 



(i 



(14) 



Compared with equations (|10[) . updating the belief propaga- 
tion equations takes 0(N 2 q 2 ) steps. 

Most real world networks, however, are relatively sparse, 
i.e. the number of edges is much smaller than iV 2 . For such 
cases the BP equations can be simplified. To see this we con- 
sider c rs = Np r3 — O(l), in the limit N — > oo terms o(N) 
can be neglected as in [2], one then needs to keep and update 
messages tpl^ 3 only when Aij — 1. The update equation for 
field hl^ j then is 



l0g ( ^ Crs^' )-JfJ2^2 C -V>* 

k£8i\j \ s / fc = l s 



(15) 



where di denotes i's neighborhood. In order to get the 
marginal probability ipl one uses eq. (|10|l and 



(16) 



^ log ( ^ Crs4>s^ 1 ) ~ Jj Yl ^ C rs ^, 



kedi 



Note that it is possible to implement the update of all fields 
h\ in 0(N(k)q 2 ) steps, thus making the BP approach as fast 
the the naive mean field. In order to do that, we compute 
the second term in eq. (|15f) once at the beginning and then 
we only add and subtract the contributions to this term that 
changed. 

Once the fixed point of BP equations is found, one uses the 
Bethe formula to compute the free energy [55] 

(ij)eE i \ s / 

where 

r,s 

Again the Bethe free energy is exact if the graphical model is a 
tree and is a very good approximation to the true free energy 
in many practical cases, and often a much better one than 
the MF free energy. An important point is that the Bethe free 
energy is not guarantied to be a bound on the log-likelihood. 

Setting VflJ-BP equal to zero and observing that the BP 
equations are stationarity conditions for the Bethe free en- 
ergy, one derives the following equations for the M-step of 
expectation maximization 

i 

~" Np r p s ^ 

5 Performance Comparison 

We will compare the performance of the three approaches pre- 
sented in the last section on ensembles of test networks which 
have been generated from ([T]). Hence, we know the true assign- 
ment of nodes into classes ti for all nodes i G {1, .., TV}. Let 
us denote by t* the estimates of group assignment that follow 
from the above algorithms. A simple performance measure is 
then the "overlap" between {ti} and {£*} defined as 

Qsi-maxV^*,^))- (19) 

i 

Since the ti can only be recovered up to permutation of the 
class labels, the maximum over all possible permutations of n 
on q elements is taken. Note that a trivial estimate would be 
t* — argmax r p r Vi. Hence, only values of Q > max r p r should 
be considered as successful inference. 

5.1 Belief Propagation vs Mean Field 

To make a comparison of BP and MF we will assume in both 
approaches that the parameters p r , p rs , and the right number 
of groups q are known. Both approaches output the estimates 
of marginal probabilities tp^.. In order to estimate the origi- 
nal group assignment, we assign to each node its most-likely 
group, i.e. 

t* — argmax r -(/v . (20) 

If the maximum of tp^. is not unique, we choose at random from 
all the qi achieving the maximum. We refer to this method of 



estimating the groups as marginalization. Indeed, a standard 
result show that it is the optimal estimator of the original 
group assignment {ti} if we seek to maximize the number of 
nodes at which ti = t*. 

In practical situations, when the true assignment is not 
known, one can also use the estimates of the marginal prob- 
abilities 1])% to compute the confidence of the method about 
the estimate t* defined as 

i 

An important remark is that if the marginals %p* were evalu- 
ated exactly then in the large TV limit the overlap and con- 
fidence quantities agree, C = Q. In our tests the quantity 
C — Q hence measures the amount of illusive confidence of 
the method. Values of C — Q larger than zero are very un- 
desirable as they indicate a misleading correlation, and give 
an illusive information on the amount of information recon- 
structed. 

To compare the performance of BP and MF, we generated 
networks from the "four groups test" of [TS] with a large 
number of variables TV, four groups q — 4, average degree 
c = Po/N, and ratio e between the probability of being con- 
nected to a different group and within the same group. In 
other words, e — c out /ci n . See an example adjacency matrix 
in Fig. [T] The results of inference using BP and MF are plot- 
ted in Fig. [21 From Fig. [5] we see several important points in 
which BP is superior over MF 

• BP estimate gives better agreement with the true assign- 
ment. In the left and right part of Fig. [2] we see the fol- 
lowing. In a region of large overlap, the two methods give 
the same result. This can be understood from the form of 
the BP and MF equations that become equivalent for very 
polarized marginals ip x r . In the region of very small overlap 
both approaches converge to a fixed point that does not 
contain any information about the original group assign- 
ment. However, for parameter values close to the possible- 
impossible-inference phase transition the BP method gives 
larger overlap with the true group assignment than MF. 

• BP is not over-confident. In the left and right part of Fig. [2] 
we compare the true overlap to the confidence value (|2ip . 
For BP the two agree, just as they should if the marginals 
were evaluated exactly. In the MF approach, however, the 
confidence is considerably larger than the true overlap. This 
means that in the whole region where C — Q > 0, MF is 
misleadingly confident about the quality of the fixed point 
it found. The width of this region depends on the parameter 
values, but we observed that a good rule of thumb is that 
if the overlap reached is not very close to 1, then the MF 
method is unreliable. 

• BP is faster. As we explained when we exposed the BP 
and MF equations, one iteration takes a comparable time 
for both methods. In the middle part of Fig. [2] we plot the 
number of iterations needed for convergence, we see that 
again around the phase transitions region MF needs more 
iterations to converge, and hence is overall slower that BP. 

• BP does not converge to several different fixed points. 
Starting with randomly initialized messages, BP converged 
to the same fixed point (up to small fluctuations) in all the 
runs we observed. On the other hand in the region where 



Figure 1. Adjacency matrices representing the block structure used for generating the various examples of the block model eq. Q in 
this contribution. Rows and columns are ordered such that rows/columns corresponding to nodes with the same ti are next to each other. 
From left to right: a q = 2 modular network, a core-periphery structure, and a q = 4 modular network. 
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Figure 2. Comparison between the naive mean field (MF) and belief propagation (BP) approaches to the E-step of expectation maxi- 
mization. All datapoints correspond to networks with N = 10 4 nodes. The networks were generated using q = 4 groups, modular structure 
as sketched in left part of Fig.[T] and c Tr = c; n > c rs = c ou t Vs ^ r. Left: True and illusive overlap Q and C for inference of the group 
assignment at different values of e = c ou t/c; n . Note the transition between a phase where inference of class membership is possible and 
where it is not at e c = 0.43. Also note that MF is overfitting the data, showing large illusive overlap C in the region where inference is in 
fact impossible. Middle: The number of iterations needed for convergence of the E-step for the problem instances from the left part (we 
set the maximum number of iterations to be 1000). The computational effort is maximal at around e c for both methods, but BP converges 
faster. Right: True and illusive overlap Q and C at different values of the average connectivity c = (k) at fixed e = 0.35. Again, we observe 
a transition between feasible and infeasible inference at (k) c (e) and the over- confidence of MF in the infeasible region. 



the MF value of confidence C differs from the true overlap 
Q MF converged to several different fixed points depending 
on the initial conditions. 

To summarize, BP for block model inference is superior to 
MF in terms of speed, of quality of the result and does not 
suffer from over-confidence the way MF does. Note that sim- 
ilar conclusions about BP compared to MF were reached for 
other inference problems in e.g. [241 122] . 

An important point is that so far, have have discussed the 
situation of BP and MF algorithms using the known and cor- 
rect values of parameters p r , Prs in the E-step of expectation 
maximization. Concerning the M-step, we observed without 
surprise that the expectation maximization with BP gives bet- 
ter results than with MF in the region of parameters where 
BP is superior for the E-step. Otherwise the performance 
was comparable. Notably, both the approaches suffer from 
a strong dependence on the initial conditions of the param- 
eters This is a known problem in general expectation 
maximization algorithms [ID]. The problem comes from the 
fact that the log-likelihood C(8) has many local maxima (each 
corresponding to a fixed point) in 9 in which the expectation 
maximization update gets stuck. Fortunately the free energy 
serves as an indicator of which fixed point of EM is better. 



Hence a solution is to run the EM algorithm from many differ- 
ent initial conditions and to consider the fixed point with the 
smallest free energy (i.e. largest likelihood). Since the volume 
of possible parameters does not grow in the system size N, 
this still leads to an algorithm linear in the system size (for 
sparse networks). However, the increase in the running time 
is considerable and smarter initializations of the parameters 
p*7° are desired. We introduce one such in the next section. 

5.2 Spectral methods 

Methods based on the eigenvectors of the adjacency matrix 
of the network provide one of the most flexible approaches of 
graph clustering problems applied in the practice and hence 
we compare the BP algorithm to this approach as well. The 
comparison of BP with modularity matrix based and random 
walker based spectral methods gives the following conclusions: 

• In the case when the parameters 8 are known and we search 
for the best estimate of the original group assignment we 
observed that BP is always better than the two spectral 
clustering algorithms (that is the random walker based and 
the modularity based one) that we tested. This is illustrated 



Random walk SP + 

BP 

Modularity SP 




1 r- 

0.9 
0.8 
0.7 
0.6 

0.5^ 




Random walk SP 
BP 

>/JVIodularity SP 



0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 

E= Cniit/Ci n i 



6 8 10 

average degree 



12 



Figure 3. Comparison of the BP (only the E-step) and spectral clustering based on the random walker approach and the modularity 
matrix (see text). Datapoints correspond to networks with N = 10 6 nodes for the two spectral approaches and N = 10 5 for BP, and 
average degree c = (k) = 3. The networks of q = 2 groups are generated using modular structure as sketched in the left part of Fig.[T] To 
ensure the random walk based method to work, we extracted the largest connected component of the network and ran the algorithm on 
it. Left: The overlap Q at different values of e = c ou t/ci n . Note how the spectral approaches can only correctly recover the latent class 
labels deep in the feasible region of the parameter space. Right: The overlap Q at different values of the connectivity c at fixed e = 0.3. 
Again, the spectral methods can only identify the latent class labels for problem instances well within the feasible region and fail on the 
hard instances near the critical connectivity. 
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Figure 4. Left: An example where the EM with BP when initialized in a random matrix c a f, does not work, whereas the random walker 
spectral method works well. The result of the spectral method serves as a initialization of c a j, in the EM BP, which then improves the 
achieved overlap. Modular network of size N = 10 5 generated with q = 4 groups and e = 0.35. Right: An example where EM with BP 
works well even from random initial condition for the matrix c a j,, while spectral methods do not work well at all. The network exhibits 
a core periphery structure (middle panel of Fig[TJ of size N = f0 4 . Here average degree of core variables is equal to average degree of 
periphery variables. There are two groups of sizes p a = 2/3 and p;, = 1/3, c a j, matrix is in form of {ci n , c; D ; Ci , c ou t}, with Ci n = gr^, 
tout = ec; n and Ci a = 1 — 0.5e. The modularity based method gives overlap 2/3, because all variables were assigned to group 1. 



in Fig. [3] and [4] In some cases (e.g. Fig. [4] left) the improve- 
ment BP provides over spectral methods is marginal. In 
other cases, e.g. for the core-periphery network of Fig. [4] 
right the improvement is drastic. 

A particularly important point we want to make is the fol- 
lowing: For the cases tested in this paper the spectral meth- 
ods are clearly suboptimal: there are regions where the BP 
inference gives large overlap while spectral clustering meth- 
ods do not do better than chance. See for instance Fig.[3]left 
for 0.1 < e < 0.268. Recently authors of [13] claimed "No 
other method will succeed in the regime where the mod- 
ularity method fails", it was mentioned that their results 
may not be valid for networks with small average degree. 
Here we clearly show that for networks with small average 
degree the spectral methods are indeed not optimal. In our 
opinion, the conclusions of |13| apply only when the average 



degree diverges with the system size. 

A final point is that the spectral method should thus not 
be thought as the end of the story, but rather as the begin- 
ning: Indeed, they are extremely useful as a starting point 
for initializing EM BP to achieve improved overlap. This is 
shown in Fig. [4] left where EM BP starts from parameters 
taken from the result of the random walker based spectral 
method. This clearly improves the quality of the inference 
without having to restart EM BP for many initial condi- 
tions. 



6 Conclusions 

Using the example of latent variable inference in the stochas- 
tic block model of complex networks, we have compared belief 
propagation based inference techniques with traditional mean 



field approaches and classic spectral heuristics. To this end, 
we have used the recent discovery of a sharp transition in the 
parameter space of the stochastic block model from a phase 
where inference is possible to a phase where inference is prov- 
ably impossible. In the vicinity of the phase transition, we find 
particularly hard problem instances that allow a performance 
comparison in a very controlled environment. 

We could show that though spectral heuristics are appeal- 
ing at first for their speed and uniqueness of the resulting 
decompositions, they only work reliably deep within the pa- 
rameter region of feasible inference. In particular, very sparse 
graphs are difficult for spectral methods, as are block struc- 
tures that are more complicated than a mere collection of 
cohesive subgraphs or communities. In short, they serve as a 
"quick and dirty" approach. We also evaluate if recent claims 
on the optimality of spectral methods for block structure de- 
tection hold for networks with small average degree [13] . 

Comparing naive mean field techniques with belief prop- 
agation techniques, we find that the computational burden, 
which has so far hindered the wide spread use of belief propa- 
gation in fully connected graphical models such as block struc- 
ture inference of (sparse or dense) networks, has been lifted 
completely. Not only is the computational complexity of the 
variable updates the same, belief propagation also exhibits 
much better convergence properties and this in particular on 
the hard problem instances. Hence, we expect that the pre- 
sented formulations of belief propagation equations may find 
a wide range of application also in other fields of inference 
with fully connected graphical models. Note that the regime 
of p rs = 0(1/ N) considered here corresponds to the max- 
imally sparse case. BP will still outperform other methods 
when p ra = 0(N~ a ) with a < 1, albeit the performance dif- 
ferences will be much smaller. 

Finally, we could show that using spectral decompositions 
in order to select initial conditions for learning the parameters 
of the stochastic block model can be a viable step in order 
to reduce the dependency on initial conditions when used in 
conjunction with expectation maximization type algorithms. 
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